infinite data
When Additive Noise Meets Unobserved Mediators: Bivariate Denoising Diffusion for Causal Discovery
Distinguishing cause and effect from bivariate observational data is a foundational problem in many disciplines, but challenging without additional assumptions. Additive noise models (ANMs) are widely used to enable sample-efficient bivariate causal discovery. However, conventional ANM-based methods fail when unobserved mediators corrupt the causal relationship between variables. This paper makes three key contributions: first, we rigorously characterize why standard ANM approaches break down in the presence of unmeasured mediators. Second, we demonstrate that prior solutions for hidden mediation are brittle in finite sample settings, limiting their practical utility. To address these gaps, we propose Bivariate Denoising Diffusion (BiDD) for causal discovery, a method designed to handle latent noise introduced by unmeasured mediators. Unlike prior methods that infer directionality through mean squared error loss comparisons, our approach introduces a novel independence test statistic: during the noising and denoising processes for each variable, we condition on the other variable as input and evaluate the independence of the predicted noise relative to this input. We prove asymptotic consistency of BiDD under the ANM, and conjecture that it performs well under hidden mediation. Experiments on synthetic and real-world data demonstrate consistent performance, outperforming existing methods in mediator-corrupted settings while maintaining strong performance in mediator-free settings.
When Additive Noise Meets Unobserved Mediators: Bivariate Denoising Diffusion for Causal Discovery
Meier, Dominik, Hiremath, Sujai, Ghosal, Promit, Gan, Kyra
Distinguishing cause and effect from bivariate observational data is a foundational problem in many disciplines, but challenging without additional assumptions. Additive noise models (ANMs) are widely used to enable sample-efficient bivariate causal discovery. However, conventional ANM-based methods fail when unobserved mediators corrupt the causal relationship between variables. This paper makes three key contributions: first, we rigorously characterize why standard ANM approaches break down in the presence of unmeasured mediators. Second, we demonstrate that prior solutions for hidden mediation are brittle in finite sample settings, limiting their practical utility. To address these gaps, we propose Bivariate Denoising Diffusion (BiDD) for causal discovery, a method designed to handle latent noise introduced by unmeasured mediators. Unlike prior methods that infer directionality through mean squared error loss comparisons, our approach introduces a novel independence test statistic: during the noising and denoising processes for each variable, we condition on the other variable as input and evaluate the independence of the predicted noise relative to this input. We prove asymptotic consistency of BiDD under the ANM, and conjecture that it performs well under hidden mediation. Experiments on synthetic and real-world data demonstrate consistent performance, outperforming existing methods in mediator-corrupted settings while maintaining strong performance in mediator-free settings.
Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit
Filatov, Oleg, Ebert, Jan, Wang, Jiangtao, Kesselheim, Stefan
One of the main challenges in optimal scaling of large language models (LLMs) is the prohibitive cost of hyperparameter tuning, particularly learning rate $\eta$ and batch size $B$. While techniques like $\mu$P (Yang et al., 2022) provide scaling rules for optimal $\eta$ transfer in the infinite model size limit, the optimal scaling behavior in the infinite data size limit remains unknown. We fill in this gap by observing for the first time an intricate dependence of optimal $\eta$ scaling on the pretraining token budget $T$, $B$ and its relation to the critical batch size $B_\mathrm{crit}$, which we measure to evolve as $B_\mathrm{crit} \propto T$. Furthermore, we show that the optimal batch size is positively correlated with $B_\mathrm{crit}$: keeping it fixed becomes suboptimal over time even if learning rate is scaled optimally. Surprisingly, our results demonstrate that the observed optimal $\eta$ and $B$ dynamics are preserved with $\mu$P model scaling, challenging the conventional view of $B_\mathrm{crit}$ dependence solely on loss value. Complementing optimality, we examine the sensitivity of loss to changes in learning rate, where we find the sensitivity to decrease with increase of $T$ and to remain constant with $\mu$P model scaling. We hope our results make the first step towards a unified picture of the joint optimal data and model scaling.
Offline Policy Evaluation and Optimization under Confounding
Kausik, Chinmaya, Lu, Yangyi, Tan, Kevin, Makar, Maggie, Wang, Yixin, Tewari, Ambuj
Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies. We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value. When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients. In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks.
Learning from Infinite Data in Finite Time
We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii (nl, ...,nm)), and the model Moo that would be learned using in(cid:173) finite examples. Upper-bound the loss L(Mii' M oo) between them as a function of ii, and then minimize the algorithm's time com(cid:173) plexity f(ii) subject to the constraint that L(Moo, Mii) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach.
SEERIST releases white paper on Turning Infinite data into Insightful risk and threat Strategies
Outlines how augmented analytics changes the way security, operations and risk professionals navigate or prevent potential risks before they happen. Seerist Inc., the leading augmented analytics solution for threat and security professionals, today announced the availability of its white paper, Turning Infinite Data into Insightful Threat and Risk Strategies. This white paper was written to demonstrate how leaders can better leverage global data to make more informed, strategic decisions by combining the power of machine learning, human analysis, and natural language capabilities. "Data continues to grow at an accelerated rate every year with 89 percent of big data created in the last two years. It is simply impossible for humans to adequately access and evaluate the vast quantum of information available, yet the value it provides can be life changing and should not be ignored," said Jim Brooks, Seerist's CEO.
Neural Networks: An Introduction--Wolfram Blog
If you haven't used machine learning, deep learning and neural networks yourself, you've almost certainly heard of them. You may be familiar with their commercial use in self-driving cars, image recognition, automatic text completion, text translation and other complex data analysis, but you can also train your own neural nets to accomplish tasks like identifying objects in images, generating sequences of text or segmenting pixels of an image. With the Wolfram Language, you can get started with machine learning and neural nets faster than you think. Since deep learning and neural networks are everywhere, let's go ahead and explore what exactly they are and how you can start using them. Neural networks are a programming approach that is inspired by the neurons in the human brain and that enables computers to learn from observational data, be it images, audio, text, labels, strings or numbers.
A Few Useful Things to Know about Machine Learning.md
The paper presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks. Representation for a learner is the set if classifiers/functions that can be possibly learnt. This set is called hypothesis space. If a function is not in hypothesis space, it can not be learnt. Evaluation function tells how good the machine learning model is.